Partial Key Grouping: Load-Balanced Partitioning of Distributed Streams

نویسندگان

  • Muhammad Anis Uddin Nasir
  • Gianmarco De Francisci Morales
  • David García-Soriano
  • Nicolas Kourtellis
  • Marco Serafini
چکیده

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce PARTIAL KEY GROUPING (PKG), a new stream partitioning scheme that adapts the classical “power of two choices” to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 175% in throughput and up to 45% in latency when deployed on a real Storm cluster. PARTIAL KEY GROUPING has been integrated in Apache Storm v0.10.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Load Balancing for Skewed Streams on Heterogeneous Cluster

Streaming applications frequently encounter skewed workloads and execute on heterogeneous clusters. Optimal resource utilization in such adverse conditions becomes a challenge, as it requires inferring the resource capacities and input distribution at run time. In this paper, we tackle the aforementioned challenges by modeling them as a load balancing problem. We propose a novel partitioning st...

متن کامل

Load Balancing for Skewed Streams on Heterogeneous Clusters

Streaming applications frequently encounter skewed workloads and execute on heterogeneous clusters. Optimal resource utilization in such adverse conditions becomes a challenge, as it requires inferring the resource capacities and input distribution at run time. In this paper, we tackle the aforementioned challenges by modeling them as a load balancing problem. We propose a novel partitioning st...

متن کامل

Load Balancing Strategies for Parallel SAMR Algorithms

Highly resolved solutions of partial differential equations are important in many areas of science and technology nowadays. Only adaptive mesh refinement methods reduce the necessary work sufficiently allowing the calculation of realistic problems. Blockstructured SAMR methods are well-suited for the time-explicit computation of large-scale dynamical problems, but still require parallelization ...

متن کامل

Spatial Partitioning for Parallel Hierarchical Radiosity on Distributed Memory Architectures

This paper presents an efficient, highly scalable implementation of the Hierarchical Radiosity Algorithm. We present a clever mapping of Hierarchical Radiosity to high-dimensional spaces that manifests a locality property, which can greatly reduce communication on parallel distributed memory architectures. We use a very simple dynamic spatial partitioning method to keep the mapping balanced. We...

متن کامل

Graph partitioning for scalable distributed graph computations

Inter-node communication time constitutes a significant fraction of the execution time of graph algorithms on distributed-memory systems. Global computations on large-scale sparse graphs with skewed degree distributions are particularly challenging to optimize for, as prior work shows that it is difficult to obtain balanced partitions with low edge cuts for these graphs. In this work, we attemp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1510.07623  شماره 

صفحات  -

تاریخ انتشار 2015